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Abstract — This paper shows that the Renyi information di- 
mension (RID) of an i.i.d. sequence of mixture random variables 
polarizes to the extremal values of and 1 (fully discrete and 
continuous distributions) when transformed by an Hadamard 
matrix. This provides a natural counter-part over the reals 
of the entropy polarization phenomenon over finite fields. It 
is further shown that the polarization pattern of the RID is 
equivalent to the BEC polarization pattern, which admits a closed 
form expression. These results are used to construct universal 
and deterministic partial Hadamard matrices for analog to 
analog (A2A) compression of random i.i.d. signals. In addition, 
a framework for the A2A compression of multiple correlated 
signals is developed, providing a first counter-part of the Slepian- 
Wolf coding problem in the A2A setting. 

Index Terms — Renyi information dimension, Polarization, In- 
formation preserving matrices, Analog compression, Distributed 
analog compression, Compressed sensing. 



I. Introduction 
A. Analog to analog compression 

Analog to analog (A2A) compression of signals has recently 
gathered interest in information theory fT2")-fl5]. In A2A com- 
pression, a high dimensional analog signal x n € R." is encoded 
into a lower dimensional analog signal y m = f n (x n ) <E K m . 
The goal is to design the encoding so as to preserve in y m all 
the information about x n , and to obtain successful decoding 
for a given distortion measure like MSE or error probability. 
In particular, the encoding may be corrupted by noise. It is 
worth mentioning that when the alphabet of x and y is finite, 
this framework falls into traditional topics of information 
theory such as lossless and lossy data compression, or joint 
source-channel coding. The novelty of A2A compression is to 
consider x and y to be real valued and to impose regularity 
constrains on the encoder, in particular linearity, as motivated 
by compressed sensing (TJ, p). 

The challenge and practicality of A2A compression is to 
obtain dimensionality reduction, i.e., m/n 1, by exploiting 
a prior knowledge on the signal. This may be sparsity as in 
compressed sensing. For fc-sparse signals, and without any sta- 
bility or complexity considerations, it is not hard to see that the 
dimensionality reduction can be of order k/n. A measurement 
rate of order k/n\og(n/k) has been shown to be sufficient 
to obtain stable recovery by solving tractable optimization 
algorithms like convex programming (li minimization). This 
remarkable achievement has gathered tremendous amount of 
attention with a large variety of algorithmic solutions deployed 



over the past years. The vast majority of the research has 
however capitalized on a common sparsity model. 

Several works have explored connections between infor- 
mation theory and compressed sensing in particular j6j- 
fTT) , however it is only recently fl2) that a foundation of 
A2A compression has been developed, shifting the attention 
to probabilistic signal models beyond the sparsity structure. 
It is shown in {T2;| that under linear encoding and Lipschitz- 
continuous decoding, the fundamental limit of A2A compres- 
sion is the Renyi information dimension (RID), a measure 
whose operational meaning had remained marginal in infor- 
mation theory until ( 12) . In the case of a nonsingular mixture 
distribution, the RID is given by the mass on the continuous 
part, and for the specific case of sparse mixture distributions, 
this gives a dimensionality reduction of order k/n. It is natural 
to ask whether this improvement on compressed sensing is 
due to potentially complex or non-robust coding strategies. 
(T3j shows that robustness to noise is not a limitation of the 
framework in [12]. Two other works fl4) , (T3J have corrobo- 
rated the fact that complexity may not be a limitation either. 
In fl4[ spatially-coupled matrices are used for the encoding 
of the signal, leveraging on the analytical ground of spatially- 
coupled codes and predictions of [ 17]. In particular, fl4) shows 
that the RID is achieved using approximate message passing 
algorithm with block diagonal Gaussian measurement matrices 
measurement matrices. However, the size of the blocks are 
increasing as the measurement rate approaches the RID. In 
Q3), using a new entropy power inequality (EPI) for integer- 
valued random variables that was further developed in 1 16 1, the 



polarization technique was used to deterministically construct 
partial Hadamard matrices for encoding discrete signals over 
the reals. This provides a way to achieve a measurement rate 
of o(n) for signals with a zero RID along with a stable 
low complexity recovery algorithm. The case of mixture 
distributions was however left open in fl"5) . 

This paper proposes a new approach to A2A compression 
by means of a polarization theory over the reals. The use of 
polarization techniques for sparse recovery was proposed in 
JT8J for discrete signals, relying on coding strategies over finite 
fields. In this paper, it is shown that using the RID, one obtains 
a natural counter-part over the reals of the entropy polarization 
phenomenon |19|, |20|. Specifically, the entropy (or source) 
polarization phenomenon |20] shows that ttansforming an i.i.d. 



jsj-jij investigate LDPC coding techniques for compressed sensing 



sequence of discrete random variables using an Hadamard 
matrix polarizes the conditional entropies to the extreme values 
of and 1 (deterministic and maximally random distributions). 
We show in this paper that the RID of an i.i.d. sequence of 
mixture random variables also polarizes to the two extreme 
values and 1 (discrete and continuous distributions). To get to 
this result, properties of the RID in vector settings and related 
information measures are first developed. It is then shown that 
the RID polarization is, as opposed to the entropy polarization, 
obtained with an analytical pattern. In other words, there is no 
need to rely on algorithms to compute the set of components 
which tend to or 1, as this is given by a known pattern 



equivalent to the BEC channel polarization 1 19 1. This is then 
used to obtain universal A2A compression schemes based on 
explicit partial Hadamard matrices. The current paper focuses 
on the encoding strategies and on extracting the RID without 
specifying the decoding strategy. Numerical simulations pro- 
vide evidence that efficient message passing algorithms may 
be used in conjunction to the obtained encoders. 

Finally, the paper extends the realm of A2A compression to 
a multi signal settings. Techniques of distributed compressed 
sensing were introduced in |23| for specific classes of sparse 
signal models. We provide here an information theoretic 
framework for general multi signal A2A compression, as a 
counter part of the Slepian & Wolf coding problem in source 
compression |24) . A measurement rate region to extract the 
RID of correlated signals is obtained and is shown to be tight. 

B. Notations and preliminaries 

The set of reals, integers and positive integers will be 
denoted by K, Z and Z+ respectively. N = Z + \{0} will 
denote the set of strictly positive integers. For n € N, 
[n] = {1,2, ... ,n} denotes the sequence of integers from 1 
to n. For a set S, the cardinality of the set will be denoted by 
thus \[n}\ = n. 

All random variables are denoted by capital letters and their 
realization by lower case letter (x is a realization of the random 
variable X). The expected value and the variance of a random 
variable X are denoted by E{X} and o\. For i,j € Z, 
X\ is a column vector consisting of the random variables 
{Xi, Xj+i, . . . ,Xj} and for i > j, we set X\ equal to null. 

For a discrete random variable X with a distribution px, 
H(X) = H(px) denotes the discrete entropy of X. For 
the continuous case, h(X) — h(px) denotes the differential 
entropy of X. Throughout the paper, we assume that all of 
discrete and continuous random variables have well-defined 
discrete entropy and differential entropy respectively. For 
random elements X, Y and Z, I(X; Y) and I{X; Y\Z) denote 
the mutual information of X and Y and the conditional mutual 
information of X and Y given Z. I(X;Y\z) denotes the 
mutual information of X and Y given a specific realization 
Z = z. Hence, I{X; Y\Z) = E Z {I(X; Y\z)}. For simplicity, 
we also assume that all of the random variables (discrete, 
continuous or mixture) have finite second order moments. 

All probability distributions are assumed to be nonsingular. 
Hence, in the general case for a random variable X, the 



distribution of X can be decomposed as px = 6p c + (l — 5)pd, 
where p c and pd are the continuous and the discrete part 
of the distribution and < 5 < 1 is the weight of the 
continuous part. Thus, 6 — and 5 — 1 corresponds to the 
fully discrete and fully continuous case respectively. For such 
a probability distribution, the Renyi information dimension is 
interchangeably denoted by d(px) or d(X) and is equal to the 
weight of the continuous part 6. 

There is another representation for a random variable X that 
we will repeatedly use in the paper. Assume U is a continuous 
random variable with probability distribution p c and V is a 
discrete random variable with probability distribution pd and 
U and V are independent. Let O e {0, 1} be a binary valued 
random variable, independent of U and V with P(0 = 1) = 6. 
It is easy to see that we can represent X as X — <dU + QV, 
where = 1 — 0. In this case, the random variable X will 
have the distribution px = 5p c + (1 — S)pd- Also, if X™ is 
a sequence of such random variables with the corresponding 
binary random variables 0™, Cq = {i £ [n] : 0,; = 1} 
is a random set consisting of the position of the continuous 
components of the signal. Similarly, C® — [n]\Ce is defined 
to be the position of the discrete components. 

For a matrix $ of a given dimension m x n and a set 
S C [n], <E>s is a sub-matrix of dimension mx \S\ consisting of 
those columns of $ having index in S. Similarly, for a vector 
of random variables Xf, the vector X$ — {Xi : i € S} is a 
sub-vector of X" consisting of those random variables having 
index in S. For two matrices A and B of dimensions mi x n 
and to 2 xn, [A; B] denotes the (m 1 +m 2 ) x n matrix obtained 
by vertically concatenating A and B. 

For aniel and a q G N, [x] q — ^1 denotes the uniform 
quantization of x by interspacing |. Similarly, for a vector of 
random variables X™, \X™] q will denote the component-wise 
uniform quantization of X] 1 . 

For a(q) and 6(g) two functions of q, a(q) ^ b(q) or 
equivalently b(q) y a(q) will be used for 



b(q) - a(q) 
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Similarly, a(q) = b(q) is equivalent to a(q) ^ b(q),a(q) >z 
b(q). 

An ensemble of single terminal measurement matrices will 
be denoted by {$at}, where N is the labeling sequence and 
can be any subsequence of N. The dimension of the family 
will be denoted by tun x N, where tun is the number of 
measurements taken by $Ar. The asymptotic measurement 
rate of the ensemble is defined by lim sup^^^ ^W- . We will 
also work with an ensemble of multi terminal measurement 
matrices. We will focus to the two terminal case and the 
extension to more than two terminals will be straightforward. 
We will denote these two terminals by x,y and the cor- 
responding ensemble by {$^,$^} with the corresponding 
dimension m x N x N and m v N x N. The measurement rate 
vector for this ensemble will be denoted by (p x ,p y ), where 

Px = limsupjv-^ V Tr,Py = limsupjy^ ^ ■ 



II. Renyi information dimension 

Let X be a random variable with a probability distribution 
Px over HL The upper and the lower RID of this random 
variable are defined as follows: 

d(X) =limsup H( ™ , 

q^oo l0g 2 (<7) 

d(X)=lhnmiW}4. 
9^00 log 2 (<?) 

By Lebesque decomposition or Jordan decomposition theorem, 
any probability distribution over M like px can be written as 
a convex combination of a discrete part, a continuous part and 
a singular part, namely, 

Px = a d Pd + a c Pc + a s p s , 

where p^, p c and p s denote the discrete, continuous and the 
singular part of the distribution and a<j, a c , a s > and ay + 
a c + a s = 1. In (27), Renyi showed that if a s = 0, namely, 
there is no singular part in the distribution and px = (1 — 
S)pd + dpc for some S € [0, 1], then the RID is well-defined 
and d(X) — d(X) — d(X) = 5. Moreover, he proved that if 

H(\X n ] ") 

X™ is a continuous random vector then lim g _ i . 00 ^ = 
n, implying the RID of n for the ri-dimensional continuous 
random vector. 

Our objective is to extend the definition of RID for arbitrary 
vector random variables, which are not necessarily continuous. 
To do so, we first restrict ourselves to a rich space of random 
variables with well-defined RID. Over this space, it will be 
possible to give a full characterization of the RID as we will 
see in a moment. 

Definition 1. Let (Q, T, P) be a standard probability space. 
The space C(fl, J-, P) is defined as L = U^ =1 £ n , where C\ is 
the set of all nonsingular random variables and for n £ N\{1}, 
C n is the space of ?i-dimensional random vectors defined as 

C n ={X? : there exist k e N, A € M" xfe and Z\ 

independent and nonsingular such that X" = AZ^}. 

Remark 1. It is not difficult to see that all n-dimensional vector 
random variables, singular or nonsingular, can be well approx- 
imated in the space C, for example in £ 2 -sense. However, this 
is not sufficient to fully characterize the RID. Specially, the 
RID is discontinuous in £ p topology, p > 1, For example, we 
can construct a sequence of fully discrete random variables 
in C converging to a fully continuous random variable in £ p , 
whereas the RID of the sequence is and does not converge to 
1. Although we have such a mathematical difficulty in giving 
a characterization of the RID, we think that the space £ is 
rich enough for modeling most of the cases that we encounter 
in applications. 

Over £, we will generalize the definition of the RID to 
include joint RID, conditional RID and Renyi information 
defined as follows. 



Definition 2. Let X™ be a random vector in C. The joint RID 
of X™ provided that it exists, is defined as 

d(X?)= Ito^™ 

g^oo l0g 2 (g) 

Definition 3. Let (X™,^" 1 ) be a random vector in C. The 
conditional RID of X[ L given Y™ and Renyi information of 
Y" about X™, provided they exist, are defined as follows: 

d(x?\Yr) = iim g ([-*ruyi") 

q -+oo \og 2 (q) 
Ir{X?;YT) - d(X?) - d(x?\Yr). 

Generally, it is difficult to give a characterization of RID 
for a general multi-dimensional distribution because it can 
contain probability mass over complicated subsets or sub- 
manifolds of lower dimension. However, we will show that 
the vector Renyi information dimension is well-defined for 
the space C. In order to give the characterization of RID over 
£, we also need to define some concepts from linear algebra of 
matrices, namely, for two matrices of appropriate dimensions, 
we propose the following definition of the "influence" of one 
matrix on another matrix and "residual" of one matrix given 
another matrix. 

Definition 4. Let A and B be two arbitrary matrices of 
dimension mi X n and x n. Also let K C [n] . The influence 
of the matrix B on the matrix A and the residual of the matrix 
A given B over the column set K are defined to be 



I(A;B)[K] 
R{A-B)[K) 



rank([A;£] K ) 
rank([A;B] x ) 



rank(Aif ), 
rank(_Bx). 



Remark 2. It is easy to check that I(A; B)[K] is the amount 
of increase of the rank of the matrix Ajc by adding rows of 
the matrix Bk and R(A\B)[K] is the residual rank of the 
matrix Ak knowing the rows of the matrix Bk. Moreover, 
one can easily check that I (A; B)[K] = R(B; A)[K}. 

Theorem 1. Let (X™ , Y™) be a random vectors in the space 
C, namely, there are i.i.d. nonsingular random variables Z\ 
and two matrices A and B of dimension nxk and mxk such 
that XI = AZ\ and Y[ n = BZ\. Let = 8^ + Q l V l be 
the representation for Zi, i 6 [k]. Then, we have 

1) d(X 1 ™)=E{rank(A Ce )}, 

2) d{X?\YF) =E{R(A;B)[C e ]}, 

where Cq = {i£ [k] : 8j = 1} is the random set consisting 
of the position of continuous components. 

Remark 3. Notice that the results intuitively make sense, 
namely, for a specific realization Q\ if 9^ = we can 
neglect Zi because it is fully discrete and does not affect the 
RID. Moreover, over the continuous components the resulting 
contribution to the RID is equal to the rank of the matrix 
Acg, which is the effective dimension of the space over 
which the continuous random variable Ac e Uc e is distributed. 
Finally, all of these contributions are averaged over all possible 
realizations of 0^. 



Using Theorem [T[ it is possible to prove a list of properties 
of the RID. 

Theorem 2. Let (X",Y 1 m ) be a random vector in C as in 
Theorem [7] Then, we have the following properties: 

1) d(Xf) = d(MXi) for any arbitrary invertible matrix 
M of dimension n x n. 

2) d(Xf, Y{ n ) = d(X?) + diY^lX?). 

3) I R (Xf;Y{ n ) = I R (Y{ n ;Xf). 

4) I fl (jq i ;Y 1 m ) > and I R (X^;Y^) = if and only 
if X" and Y™ are independent after removing discrete 
common parts, namely, those Z^,i € [k] that are fully 
discrete. 

Further investigation also shows that we have a very nice 
duality between the discrete entropy and the RID as depicted 
in Table [I] As we will see in Subsection |III-B| and |ffl-C[ this 
duality can be generalized to include some of the theorems 
in classical information theory like single terminal and multi 
terminal (Slepian & Wolf) source coding problems. 



Discrete random variables 
Discrete entropy H 
Conditional entropy 
Mutual information 
Deterministic 
Chain rule 


Random variables in C 
RID d 
Conditional RID 
Renyi mutual information 
Discrete 
Chain rule 


Single terminal source coding 
Multi terminal source coding 


Single terminal A2A compression 
Multi terminal A2A compression 



TABLE I: Duality between H and d 

III. Main results 

In this section, we will give a brief overview of the 
results proved in the paper. Subsection |III- A| is devoted to the 
results obtained for the polarization of the Renyi information 
dimension. These results are used in Subsections IIII-BI and 



III-C to study A2A compression problem from an information 



theoretic point of view. Subsecti on |III-B considers the single 
terminal case whereas Subsection IIH-Cl is devoted to the multi 
terminal case. 

A. Polarization of the Renyi information dimension 

Before stating the polarization result for the RID, we define 
the m-dimensional erasure process as follows. 

Definition 5. Let a 6 [0, 1]. An "erasure process" with initial 
value a is defined as follows. 

1) e = a. e + = 2a — a 2 and e~ = a 2 . 

2) Let e„ = e 1 2 " for some arbitrary {+, — }-valued 
sequence 6™. Define 



6l&2---!>rt- 



2e r , 

2 



Remark 4. Notice that using the {+, — } labeling, we can 
construct a binary tree where each leaf of the tree is assigned 
a specific {+,— }-valued sequence. 



Let {.B n }^L 1 be a sequence of i.i.d. uniform { + , — }-valued 
random variables. By replacing B" for {+, — }-labeling 6™ in 
the definition of the erasure process, we obtain a stochastic 
process e„ = e BxB2 "' Bn . Let F n be the er-field generated 
by B1 . Using the BEC polarization (19), (21], we have the 
following results: 

1) (e n ,J-" n ) is a positive bounded martingale. 

2) e n converges to eoo € {0, 1} with P(eoo = 1) = a. 

3) For any < (3 < \, lim inf P(e„ < 2~ N$ ) = 
1 — a, where = 2" is the number of all possible 
cases that e„ can take. 

Let n £ N and N = 2™. Assume that X^ is a sequence of 
i.i.d. nonsingular random variables with a RID equal to d(X) 
and let — H^Xi , where Hm is the Hadamard matrix 
of order N. For i € [N], let us define I n (i) = d{Z l \Z 1 - 1 ). 
Assume that 6™ is the binary expansion of i — 1. By replacing 
by + and 1 by — , we can equivalently represent I n (i) be a 
sequence of {+, — } values, namely, I n (i) = / b i b 2 - fcn Similar 
to the erasure process, we can convert /„ to a stochastic 
process /„ = j B i B 2— B n by using i.i.d. uniform {+,—}- 
valued random variables B" . We have the following theorem. 

Theorem 3 (Single terminal RID polarization). (7„, F n ) is an 
erasure stochastic process with initial value d(X) polarizing 
to {0, 1}. 

For n € N and N = 2™, let {(Xj,Yi)} be a sequences 
of random vectors in the space C, with joint and conditional 
RID d(X,Y), d(X\Y) and d(Y\X). Let = H N X? and 
assume that — HpjY^ . Let us define two processes /„ 
and J n as follows. 



i n (i) = d(z l \z l - 1 ),ie [N], 
J n (i) = d(Wi\wt\z»),ie[N}. 

Similarly, we can label /„ and J n by a sequence of 6™ 
and convert them to stochastic processes /„ = J B ^ B 2—B„ an( j 
J n — j s i s 2 s„ gy definition, we have the following 
theorem. 

Theorem 4 (Multi terminal RID polarization). {I ni F n ) and 
( J n ,J~ n ) are erasure stochastic processes with initial value 
d(X) and d(Y\X), both polarizing to {0, 1}. 

Remark 5. In the t terminal case t > 2 for a t termi- 
nal source (Xi, X2, ■ ■ ■ X t ), using a similar method it is 
possible to construct erasure processes with initial values 
d(X 1 ),d(X 2 \X 1 ),.. .^{XtlXl' 1 ), polarizing to {0,1}. 

B. Single terminal A2A compression 

In this subsection, we will use the properties of the RID 
developed in Section [II] to study the A2A compression of 
memoryless sources. We assume that we have a memoryless 
source with some given probability distribution. The idea is to 
capture the information of the source, to be made clearer in 
a moment, by taking some linear measurements. As is usual 
in information theory, we are mostly interested in asymptotic 
regime for large block lengths. To do so, we will use an 



ensemble of measurement matrices to analyze the asymptotic 
behavior. We will also define the notion of REP (restricted iso- 
entropy property) for an ensemble of measurement matrices. 
This subsection is devoted to the single terminal case. The 
results for the multi terminal case will be given in Subsection 



III-C We are mostly interested to the the measurement rate 



region of the problem in order to successfully capture the 
source. 

Definition 6. Let X^ be a sequence of i.i.d. random variables 
with a probability distribution px (discrete, mixture or con- 
tinuous) over R, and let Df = [Xf] q for q e N. The family 
of measurement matrices {$at}, indexed with a subsequence 
of N and with dimension mjv x N, is e-REP(px) with the 
measurement rate p if 



limsup ^ < e, 



q— »oo 



H(D») 



(1) 



,. rn N 
limsup-— = p. 

AM-oo JV 



To give some intuitive justification for the REP definition, 
let us assume that all of the measurements are captured with 
a device with finite precision i for some qo <= N. In that 
case, although the potential information of the signal, in terms 
of bits, can be very large, but what we effectively observe 
through the finite precision device is only H([X^] qo ). In such 
a setting, the ratio of the information we lose after taking the 
measurements, assuming that some genie gives us the infinite 
precision measurement captured from the signal, is exactly 
what we have in the definition of REP, namely, 

where we assume that = [Xf] go . This might be a 

reasonable model for application because pretty much this 
is what happens in reality. The problem with this model is 
that it is not invariant under some obvious transformations 
like scaling. For example, assume that we are scaling the 
signal by some real number. In this case, through some simple 
examples it is possible to show that the ratio in Q can change 
considerably. There are two approaches to cope with this 
problem. One is to scale the signal with a desired factor to 
match it to the finite precision quantizer, which in its own 
can be very interesting to analyze but probably will be two 
complicated. The other way, is to take our approach and 
develop a theory for the case in which the resolution is high 
enough so that the quality measure proposed in (|2]) is not 
affected by the shape of the distribution of the signal. 

Remark 6. Notice that in the fully discrete case, the REP 
definition is simplified to the equivalent form 



< e, 



v mN ^ 

bmsup— < p. 

N-yoa JV 

Remark 7. For a non discrete source with strictly positive RID, 
d(X) > 0, if we divide the numerator and the denumerator 



in the expression ([T]i by log 2 (<?), take the limit as q tends to 
infinity and use the definition of the RID, we get the equivalent 
form 

d(x») - e - 

Interestingly, this implies that in the high resolution regime 
that we are considering for analysis, the information isometry 
(keeping more than 1 — e ratio of the information of the 
signal) is equivalent to the Renyi isometry. Moreover, from 
the properties of the RID, it is easy to see that this REP 
measure meets some of the invariance requirements that we 
expect. For example, it is scale invariant and any invertible 
linear transformation of the input signal Xf keeps the e-REP 
measure unchanged. 

We can also extend the definition when the probability 
distribution of the source is not known exactly but it is known 
to belong to a given collection of distributions IT. 

Definition 7. Assume II = {tt : tt £ 11} is a class of 
nonsingular probability distributions over K. The family of 
measurement matrices {^jv}, indexed with a subsequence of 
N and with dimension x N, is e-REP(II) for measurement 
rate p if it is e-REP(7r) for every tt G II. 

Now that we have the required tools and definitions, we 
give a characterization of the required measurement rate in 
order to keep the information isometry. Similar to all theorems 
in information theory, we do this using the "converse" and 
"achievability" parts. 

Theorem 5 (Converse result). Let Xf be a sequence of 
i.i.d. random variables in C. Suppose {$jv} is a family of e- 
REP(px) measurement matrices of dimension mx X N, then 
p>d{X{){l-e). 

Remark 8. This result implies that to capture the information 
of the signal the asymptotic measurement rate must be ap- 
proximately greater then the RID of the source. This in some 
sense is similar to the single terminal source coding problem 
in which the encoding rate must be grater then the entropy of 
the source. This again the emphasizes the analogy between H 
and d. Moreover, in the discrete case, d(X) = 0, the result is 
trivial. 

Remark 9. It was proved in fl2| that under linear encoding and 
block error probability distortion condition, the measurement 
rate must be higher than the RID of the source, p > d(X). 
Theorem [5] strengthen this result stating that p > d{X) 
must hold even under the milder e-REP restriction on the 
measurement ensemble. 

Theorem [5] puts a lower bound on the measurement rate in 
order to keep the e-REP property. However, it might happen 
that there is no measurement family to achieve this bound. 
Fortunately, as we will see, it is possible to deterministically 
truncate the family of Hadamard matrices to obtain a measure- 
ment family with e-REP property and measurement rate d(X). 
This is summarized in the following two theorems. Notice that 
in the fully continuous case as Theorem [5] implies, the feasible 



measurement rate is approximately 1 which for example can 
be achieved with any complete orthonormal family, thus no 
explicit construction is necessary. For the noncontinuous case, 
we will distinguish between the fully discrete case and the 
mixture case because they need different proof techniques. 
Theorem [6] and [7] summarize the results. 

Theorem 6 (Achievability result). Let X^ be a sequence 
of i.i.d. discrete intege^valued random variables. Then, for 
any e > 0, there is a family of e-REP(px) partial Hadamard 
matrices of dimension x JV, for N — 2™ with p = 0. 

Theorem 7 (Achievability result). Let X^ be a sequence of 
i.i.d. random variables in C. Then, for any e > 0, there is a 
family of e-REP(px) partial Hadamard matrices of dimension 
m N x N, for N = 2™ with p = d(Jfi). 

We have also the general result in Theorem [8] which implies 
that we can construct a family of truncated Hadamard matrices 
which is e-REP for a class of distributions. 

Theorem 8 (Achievability result). Let II be a family of 
probability distributions with strictly positive RID. Then, for 
any e > 0, there is a family of e-REP(H) partial Hadamard 
matrices of dimension mx x N, for N = 2", with p = 

Remark 10. Theorem [8] implies that there is a fixed ensemble 
of measurement matrices capable of capturing the information 
of the all of the distributions in the family II. This is very 
useful in applications because usually taking the measurements 
is costly and most of the time we do not have the exact 
distribution of the signal. If each distribution needs its own 
specific measurement matrix, we have to do several rounds of 
the measurement each time taking the measurements compat- 
ible with one specific distribution and do the recovery process 
for that specific distribution. The benefit of Theorem [8] is that 
one measurement ensemble works for all of distributions. It is 
also good to notice that although the measurement ensemble 
is fixed, the recovery (decoding) process might need to know 
the exact distribution of the signal in order to have successful 
recovery. 

C. Multi terminal A2A compression 

In this section, our goal is to extend the A2A compression 
theory from the single terminal case to the multi terminal case. 
In the multi terminal setting, we have a memoryless source 
which is distributed in more than one terminal and we are 
going to take linear measurements from different terminals in 
order to capture the information of the source. We are again 
interested in an asymptotic regime for large block lengths. To 
do so, we will use an ensemble of distributed measurement 
matrices that we will introduce in a moment. Similar to the 
single terminal case, we are interested in the measurement rate 
region of the problem, namely, the number of measurements 

2 We proved this theorem using the EPI result we developed in |l6), where 
we proved the result for lattice discrete random variables. However, we believe 
that such a result is also true for non-lattice discrete distributions. 



that we need from different terminals in order to capture 
the signal faithfully. We will analyze the problem for two 
terminal case. The extension to more than two terminals is 
straightforward. 

Definition 8. Let {(Xj, Yi)}f =1 be a two terminal memoryless 
source with (Xi,Y\) being in C The family of distributed 
measurement matrices {Q x Nl ^at}' indexed with a subsequence 
of N, is e-REP(px,r) for the measurement rate (p x , p y ) if 



lim sup 



q— ¥oo 



N 



V 



< e, (3) 



lim sup 

N->aa JV 



< 



lim sup 



in 



N 

N 



< 



Py 



Remark 11. If (X, Y) is a random vector in C with d(X, Y) > 
0, similar to what did in the single terminal case, dividing 
the numerator and the denumerator in the expression Q by 
log 2 (g) and taking the limit as q tends to infinity, we get the 
equivalent definition 



dQ^Y^%Xl^Y[ 

which implies the equivalence of the information isometry and 
the Renyi isometry. 

Remark 12. Notice that in the fully discrete case, the definition 
above is simplified to the equivalent form 



H(X»,Yf) 

m N 



lim sup — — < p x , lim sup 

AT-^oo ^ TV-s-oo ^ 



N 



< 



Py 



We can also extend the definition to a class of probability 
distributions. 

Definition 9. Assume that II = {tt : tt € 13} is a class of 
nonsingular probability distributions in C. The family of mea- 
surement matrices {$^ r ,<P^} is e-REP(II) for measurement 
rate (p x ,py) if it is e-REP(7r) for every tt £ II. 

Definition 10. Let (X, Y) be a two dimensional random 
vector in C with a distribution px,Y- The Renyi information 
region of px,Y is the set of all (p x ,p y ) £ [0, l] 2 satisfying 

Px > d(X\Y), p y > d(Y\X), Px + p y > d(X, Y). 

Definition 11. Assume that II is a class of two dimensional 
random vectors from C. The Renyi information region of the 
class II is the intersection of the Renyi information regions of 
the distributions in II. 

Similar to the single terminal case, we are interested in the 
rate region of the problem. We have the following converse 
and achievability results. 

Theorem 9 (Converse result). Let {{Xi,Yi)}f =1 be a two- 
terminal memoryless source with {X\, Y\) being in C. Assume 



that the distributed family of measurement matrices {<& X N , & V N } 
is e-REP with a measurement rate (p x ,p y ). Then, 

p x + Py >d(X,Y)(l-e), 

Px > d(X\Y) - ed(X, Y), p y > d{Y\X) - ed(X, Y). 

Remark 13. This rate region is very similar to the rate region 
of the distributed source coding (Slepian & Wolf) problem with 
the only difference that the discrete entropy has been replaced 
by the RID, which again emphasizes the analogy between the 
discrete entropy and the RID. Similar to the Slepian & Wolf 
problem, we call p x + p y = d(X, Y) the dominant face of the 
measurement rate region. 

Theorem 10 (Acievability result). Let {{Xi,Yi)}f =1 be a 
discrete two-terminal memoryless source. Then there is a 
family of e-REP partial Hadamard matrices w ^ 

( Px ,Py) = (0,0). 

Theorem 11 (Achievability result). Let {(X t ,Yi)}f =l be a 
two-terminal memoryless source with [X\,Yi) belonging to 
C. Given any (p Xl p y ) satisfying 

p x +Py > d(Jfi s yi),p a > d(X 1 \Y 1 ),p y > diYtlXx), 

there is a family of e-REP partial Hadamard matrices with 
measurement rate (p x ,p y ). 



We have also the general result in Theorem 12 which 
implies that we can construct a family of truncated Hadamard 
matrices which is e-REP for a class of distributions. 

Theorem 12 (Achievability result). Let H be a family of 
two dimensional probability distributions in C. Then, for any 
(p x , p y ) in the measurement region of II, there is a family 
of partial Hadamard matrices which is e-REP (TD with a 
measurement rate (p Xl p y ). 

IV. Proof techniques 

In this section, we will give a brief overview of the 
techniques used to prove the results. We will divide this section 
into three subsections. In Subsection |IV-A[ we will overview 



the proof techniques for the RID. Subsection IV-C and IV-D 



will be devoted to proof ideas and intuitions about the A2A 
compression problem in the single and multi terminal case. 

A. Renyi information dimension 

in this section we will prove Theorem [T] and [2] and we will 
give further intuitions about the RID over the space C. 

Proof of Theorem [TJ To prove the first part of the theorem, 
notice that 

H([x?] q ) = h([x[%, e*) = H{[x?] q \e*), 

because i?(6i) < k = 0. As 9f € {0, l} fc and takes finitely 
many values, it is sufficient to show that for any realization 

el, 



Taking the expectation over ®\, we will get the result. To 
prove Q, notice that 

H{[X«] q \e k x ) = H([A Ce U Ce +A 0e V Ce ] q ) 

= H([A Ce Uc e +Aa e V 0e ] q \V Ce ) (5) 

= H{\A Cg U Ce ] q ), (6) 

where we used H{Vq ) < NH(Vi) = 0. We also used the 
fact that knowing V 0e , [A Ce U Ce ] q and [A Cg U Ce + A 0g V Cg ] q 
are equal up to finite uncertainty. Specifically, suppose L is 
the minimum number of lattices of size | required to cover 

A Ce X [°> I]' 09 '' which 

is a finite number. Then 

H([A Cs Uc e } q \V Ce MceUc e + A Ce V CB ] q ) < log 2 (i), 

which implies |5]) and (|6j. 

Generally Ac g is not full rank. Assume that the rank of Ac e 
is equal to m and let A m be a subset of linearly independent 
rows. It is not difficult to see that knowing [A m Uc e ] q there 
is only finite uncertainty in the remaining components of 
[Ac e Ucg] q , which is negligible compared with log 2 (<?) as q 
tends to infinity. Therefore, we obtain 

H([X?] q \e*) ± H([Ac,U c ,] q ) 
= H{[A m U Ce ] q ) 
= m\og 2 (q). 

Thus, taking the limit as q tends to infinity, we obtain 



lim 

q— > oo 



rank(^4,; 



lim 

q— too 



H{[Xl]q\Ql) 

log 2 (<7) 



= rank(Ac„)- 



(4) 



log 2 (<?) 

Also, taking the expectation with respect to Q\, we obtain 
d(Xi) = E{rank(Ac e )}> which is the desired result. 
To prove the second part of the theorem, notice that 

H{[x^] q \Yn = m^\ q \YrM)- 

For a specific realization 6\ we have 

H{[X-] q \YrA) 

= H([A Ce U Ce + A Ce V Ce \ q \B Ce U Ce + B 0e V Ge ) 

= H([A Ce U Ce + A Ce V Ce ] q \B Cg U Ce + B 0e Vc e ,Vc e ) 

= H{[A Ce Uc e ]q\B Ce Uc e ). 

Generally, Ac e is not full-rank. Let A m be the set of all 
linearly independent rows of Ac e of size m. Then 

H([A Ce Uc e } q \B Cg U Ce ) = H([A m U Ce } q \B Ce U Cg ). 

It may happen that some of the rows of A m can be written as 
a linear combination of rows of Bc e . Let A r be the remaining 
matrix after dropping m — r predictable rows of A m . Given, 
Bc g Uc e , A r Uc e has a continuous distribution thus 

H([A r U Ce ] q \B Ce U Ce )=rlog 2 (q). 

It is easy to check that r is exactly R(A; B)[Cg]. Therefore, 
taking the expectation with respect to 0^, we get 

d(X?\Yn=E{R(A;B)lC e }}. 



We also get the following corollary, which shows the 
additive property of the RID for the independent random 
variables from C. 

Corollary 1. Let A™ be independent random variables from 
L. Then d(Xf) = Y$=\ d{Xi). 



Proof: Notice that we can simply write X[ 



N 



N 



where Ijy is the identity matrix of order N . Therefore, by the 
rank characterization for the RID, we have 



d(Af ) =E{rank(/jv[C ])} 



N 
i=l 



N 

£ 



d(Xi), 



where we used the fact that the columns of 1^ are linearly 
independent thus adding a column increases the rank by 1. 
Therefore, the rank of In(Cs) is equal to the number of l's 
is 6f , namely, YhLi ©i- ■ 
Using the results of Theorem [T] we can prove Theorem [2] 
Proof of Theorem [2} For part 1, the proof is simple by 
considering the rank characterization. We know that A™ = 
AZ\ and d(Xf) = E{rank(,4 Ce )}. Moreover, MA? = 
MAZ\ thus d(X? ) = E{rank(MA Ce )}. As M is invertible 
rank(Ace) = rank(MAc e ), thus we get the result. 

For part 2, notice that for any realization Q\ and the 
corresponding set Cg, 

rank([A;S]c 9 ) = rank(A c J + R(B; A)[C e ] 
= v&nk(B Ge )+R(A; B)[C e ]. 

Taking the expectation over 0f , we get the desired result 

d(x? ,Fn = <w) + tww) = rf(>r i ) + d(^ri^r). 

For part 3, using the chain rule result from part 2 and 
applying the definition of 1 r(X" , \Y{ n ), we get 

I R (X?; YD = rf(Xf) + d(y7») - d(X?,Y ± m ), 

which shows the symmetry of Ir with respect to X™ and Y™. 

For part 4, notice that for a specific realization 6\, a simple 
rank check shows that R(A; B)[C$] < rank(Ac e ). Taking the 
expectation over 6^ we get d(Xf\Y^) < (Z(Xf). 

If X™ and Y™ are independent, the equality follows from 
the definition. For the converse part, notice that if X™ is fully 
discrete then d{X^\Y^) < d(X[ l ) = 0. Similarly, if Y{ n 
is fully discrete then d(Y^\X^) < d{Y™) = and using 
the identity d(X?) - d{X?\Y{ n ) = d(Y{ n ) - d(Y{ n \X^), we 
get the equality. This case is fine because after removing the 
discrete Zi,i £ [k], either X™ or Y{ n is equal to 0, namely, a 
deterministic value, and the independence holds. 

Assume that none of AT™ or Y™ is fully discrete. Without 
loss of generality, let Z[ be the non-discrete random variables 
among Z\ and let A" and Y™ be the resulting random 
vectors after dropping the discrete constituents, namely, we 
have A? = A r Z{ and Y[ n = B r Z[, where A r and B r are 
the matrices consisting of the first r columns of A and B 
respectively. It is easy to check that d(X[ l ) — d(X[ l ) and 
d(X?\Y{ n ) = d(X"\Y" 1 ). Thus it remains to show that Af 



and Y" 1 are independent. As we have dropped all of the dis- 
crete components, the resulting <di, i E [r] are 1 with strictly 
positive probability. This implies that for any realization of 
0™ and the corresponding Cg, R(A r ; B r )[Cg\ — r&nk(A r ^c e )- 
In particular, this holds for any Cg of size 1, namely, for any 
column of A r and B r , which implies that if A r has a non-zero 
column the corresponding column in B r must be zero and if 
B r has a non-zero column then the corresponding column in 
A r must be zero. This implies that A™ and Y™ depend on 
disjoint subsets of the random variables Z\. Therefore, they 
must be independent. 

B. Polarization of the RID 

In this section, we will prove the polarization of the RID in 
the single and multi terminal case as stated in Theorem [3] and 
Theorem [4] The main idea is to use the recursive structure of 
the Hadamard matrices and the rank characterization of the 
RID in the space C. 

Proof of Theorem |3j For the initial value, we have 
J (l) = d(Xx). Let n € N and N = 2 n . To simplify the proof, 
instead of the Hadamard matrices, H, we will use shuffled 
Hadamard matrices, H, constructed as follows: H\ = Hi and 
H2N is constructed from Hn as follows 



\ h N J 



{hi 
hi 



hi 
hi 

v; 



hi \ 

-hi 



: / 



where hi, i £ [N] denotes the z-th row of the Hm- Let A™ 
be as in Theorem [3] and let Zf — H^X^, where Hn is 
replaced by H N . Also, let I n (i) = d^Z^Z^ 1 ), i e [N]. We 
first prove that I is also an erasure process with initial value 
d(Xi) and evolves as follows 

4(0+ = L+i (2i - 1) = 2J n (i) - I n (i) 2 

L(*y = i n+ i(2i) = i n (i) 2 , 

where i 6 [N] with the corresponding {+, — }-labeling 6". 
Also, let H 1 ^ 1 and H l denote the first i — 1 and the first i 
rows of Hn. Also, let hi denote the i-th row of H^. Thus, we 
have Z\ = H l X N and Z^ 1 = H^X^. As X N are i.i.d. 
nonsingular random variables, it results that Z\ belong to the 
space C generated by the Af random variables. Notice that 
using the rank characterization for the RID over C, we have 

d(Zi\Z{- 1 )=E{I(H i - 1 ;hi)[C e }}, 

where hi)[G®\ G {0, 1} is the amount of increase of 

rank of by adding hi. Now, consider the stage n + 1, 

where we have the shuffled Hadamard matrix H2N- Consider 
the row i + which corresponds to the row 2i — 1 of H2N- Now, 
if we look at the first block of the new matrix, we simply notice 
that adding hi has the same effect in increasing the rank of 
this block as it had in H^- A similar argument holds for the 
second block. Moreover, adding hi increases the rank of the 



matrix if it increases the rank of either the first or the second 
block or both. Let lj(©|*) S {0, 1} denote the random rank 
increase in H 1 ^ 1 by adding hi, then we have 

Wen = i*(ef ) + h(e% N +1 ) - i,(ef )i i (e?f +1 ). 

@^ and QjfL i are i.i.d. random variables and a simple 
check shows that 1^(0^) and li(Qjyu_i) are also i.i.d.. Taking 
the expectation value, we obtain 

J n (i) + =2I n (i)- I n {l) 2 . 



(7) 

Moreover, if we denote Wf = HnXJj^, then by the 
structure of it is easy to see that and I n (i)~ can 

be written as follows: 

I n (i)+ =d(z l + w l \z{-\w[- 1 ), 
f n {i)- = d{Zi -Wi\Zi + w-, wj- 1 ). 

Using the chain rule for the RID, we have 

2 2 y 

1 



= -d{Z u W i \Z^\W{- 1 ) 

= d{Z u \Z[- l )=l n {i), 

which along with 07b, implies that = I n {i) 2 . Therefore, 

/ evolves like an erasure process with initial value d(X). 

Now, notice that the only difference between Hn and is 
the permutation of the rows, namely, there is a row shuf fling 
matrix Bn such that Hm = B^Hm- It was proved in [20] 
that Bn and commute, which implies that H^X^ = 
HnBnX^ . However, notice that X^ is an i.i.d. sequence and 
BnXi is again an i.i.d. sequence with the same distribution 
as Xi . In particular, adding or removing Bn does not change 
the RID values, which implies that for Z± — H^X^ and 
I n (i) — d(Z i \Z[^ 1 ), I n {i) = I n {i)- Therefore, / is also be 
an erasure process with initial value d(X), which polarizes to 
{0,1}. ■ 

Using a similar technique, we can prove Theorem [4] The 
main idea is that (X, Y) are correlated random variables in 
the space C and they can be written as a linear combination 
of i.i.d. nonsingular random variables. 

Proof of Theorem |3J For the initial value, we have 
J (l) = d(Xx) and J (l) = d(Fi|*i). As {(X.^)}^ is 
a memoryless source, similar to the single terminal case, it 
is easy to see that I is an erasure process with initial value 
d(Xi) and it remains to show that J is also an erasure process 
but with initial value d(Y\\X\). 

Let H 1 ^ 1 , H l and hi denote the first i — 1 rows, the first i 
rows, and the i-th row H^. As X±, Y\ £ C there is a sequence 
of i.i.d. random variables E\ and two vectors a\ and b\ such 
that X x = J2 k i=1 ai E % and Y x = £* =? As {{X h Y^}*L t 
is memoryless, there is a concatenation of sequence of i.i.d. 
copies of E\, E = {E^ (1), E\ (2), . . .,E$(N)}, such that 

Z? = H N X? = \{B N H N ) ® (alfjE, 
W? = H N Y? = [(B N H N ) ® (&*)*]£, 



where ® denotes the Kronecker product and (af)*, are 
the transpose of the column vectors a\ and b\. Let 



{61,62 



,e N } 



(8) 



be the random element corresponding to the pattern of 

EiU),j G [ N ]> where 0j G {0,l} fe ,J G [iV]. Using the 
rank result developed for the RID, it is easy to see that for 
every j £ [N] 



JnV) = d(W j \wr 1 ,Z?) 

= E{I([H^ 1 ® {b^f-H ® (of)*];Aj ® (6f)*)[C r ]}. 

For t G [iV], let li(0f) G {0,1} denote the random 
increase of rank of [H 1 ^ 1 ® (tti)*]c by adding hi <g> (af )'. 
Now, consider the stage n + 1, where we are going to combine 
two copies of Hm to construct the matrix H2N- The the row 
i corresponding to Wi is split into two new rows i + and i~ 
which correspond to the row number 2i— 1 and the row number 
2i of H-2N- 



(H N ®{a k 1 ) t 
H N ® (of)' 

hi-i ® 

® (6f )* 
V hi ® (6f )* 



Hjv ® (of)' \ 



-i ® (6f )' 
_i ® 



) 4 / 



Similar to the single terminal case, we see that adding hi ® 
increases the rank of the matrix if it increases the rank 
of the either the first or the second block. In other words, 



i 24 _ 1 (0n = i,(0f)+i,(0^ v + i) 



where 1, (Of), 1,(0™ 



l,(0f)l,(0^ 1 ), 



j n+i) G {0,1} are the corresponding 
amount of increase of the rank of the first and second block 



by adding the i-the row. In particular, 0^ and 8^ are i.i.d 
so are 1^(0^) and li(Qj^ +1 ). Taking the expectation, similai 
to what did in the single terminal case, we obtain that 



J n (i)+ = 2J n (i) - J n {i) 2 . (9) 
Moreover, one can also show that for i G [TV], 

Jn(i) + + Jn{i)- _ j 

which together with (^J, implies that J n (i)~ = J n (i) 2 . 
Therefore, J is also an erasure process with initial value 
d(Y\X). Similar to the single terminal case, one can also show 
that the permutation matrix Bn is not necessary, thus the proof 
is complete. ■ 

C. Single terminal A2A compression 

In this part, we will overview the techniques used to prove 
the achievability part. The converse part, given in Theorem 
[5] has been proved in Appendix [A] We will give separate 
constructions for the fully discrete case and the mixture case 
although the proof techniques used are very similar. 



Achievability proof for the mixture case: We will give an 
explicit construct of the the measurement ensemble as follows. 
Let n E N and let N = 2". Assume that X± is a sequence of 
i.i.d. nonsingular random variables with RID equal to d(X). 
Let Zi = H^Xi , where is the Hadamard matrix of 
order N . Also assume that I n (i) = d{Zi\Z { f x ), i € [N]. As 
we proved in Theorem [3] I is an erasure process with initial 
value d(X). We will construct the measurement matrix $jv 
by selecting all of the rows of i?jy with the corresponding 
/„ value greater than ed(X). Therefore, we can construct the 
measurement ensemble {$at} labelled with all TV that are a 
power of 2. Assume that the dimension of $at is x N. 
It remains to prove that the ensemble {$at} is e-REP with 
measurement rate d(X). This will complete the proof of 
Theorem [7] 

Proof of Theorem |7} We first show that the family 
{$at} has measurement rate d(X). Notice that the process I n 
converges almost surely. Thus, it also converges in probability. 
Specifically, considering the uniform probability assumption, 
this implies that 

r m N #{i E [N] : I n (i) > ed(X)} 
lim sup — — = lim sup 



N- 



N 



N- 



= lim sup 



N 

n >ed(X)) 



= ¥(I 00 >ed(X)) = d(X). 

It remains to prove that {$ N } is e-REP. Let S = {i € [N] : 
In{i) > ed(X)} denote the selected rows to construct $jv 
and let Z± — H^X^ be the full measurements. It is easy to 
check that §nX^ = Z s . Also let Bi=SC\[i— 1] denote all 
of the indices in S before i. We have 



d(X»\Zs) = d(Z?\Zs) = d(Z s .\Zs) 



< 



iSS<= 

E 
E 



d(Zi\Z Bt , Z s ) 
diZtlZt 1 ) 



I n (i)<Ned(X) = ed(X[ v ), 



which shows the e-REP property for {$at}. ■ 

Achievability proof for the discrete case: For the discrete 
case, the construction of the measurement family is very 
similar to the mixture case with the only difference that instead 
of using the erasure process corresponding to the RID, we use 
the discrete entropy function. More exactly, in the discrete 
case, assuming that Z^ — H^X^ , we define the following 
process for i G [N], I n (i) = HiZ^Z^ 1 ). In (BJ, using the 
conditional EPI result p6| , the following was proved. 

Lemma 1 ("Absorption phenomenon"). (I n ,F n ,W) is a pos- 
itive martingale converging to almost surely. 

Similar to the mixture case, we again construct the family 
{$at} by selecting those rows of the shuffled Hadamard matrix 
with / value greater than eH{X-y). 



Proof of Theorem |6f By a similar procedure, it is easy 
to show that {^at} has zero measurement rate. 



rn N 

hm sup — - 



= limsupP(/„ > eH(X 1 )) 
< P(limsup/„ > eH{X 1 )) 

n—¥oo 

= P(/oo >eH{X 1 ))=0. 

Moreover, assuming that S = {i € [N] : I n (i) > eH(Xi)} 
and Bi = S PI [i — 1], we have 

H{X?\Z S ) = H(Z?\Z S ) = H(Z S c\Z s ) 
= H(Zi\Z Bi ,Z s ) 



<Y,H{Z l \ZT 1 ) 



which show the e-REP property for {^jv}- ■ 

The last step is to prove Theorem [8] namely, to show that 
for a family of mixture distributions II with strictly positive 
RID, there is a fixed measurement family {$tv} which is e- 
REP for all of the distributions in IT with a measurement rate 
vector lying in the Renyi information region of of the family. 

Proof of Theorem [8} The proof is simple considering 
the fact that the construction of the family {^at} in the proof 
of Theorem [7] depends only on the erasure pattern. Also, the 
erasure pattern is independent of the shape of the distribution 
and only depends on its RID. Moreover, it can be shown that 
the erasure patterns for different value of S are embedded in 
one another, namely, for S > 5', > e l^l- 

Considering the method we use to construct the family {^at}, 
this implies that an e-REP measurement family designed for 
a specific RID S is e-REP for any distribution with RID less 
than 8. Thus, if we design {$a?} for sup Tgn d(w), it will be 
e-REP for any distribution in the family. ■ 

Figure [T] shows the absorption phenomenon for a binary 
random variable with P(l) = p = 0.05. Figure [2] shows the 
polarization of the RID for a random variable with RID 0.5. 

Absorption Scheme (or N=512. p=0.05 
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Fig. 1: Absorption pattern for N = 512, p = 0.05 



Polarization Scheme for N=5?2,5=0.5 
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Fig. 2: Polarization of the RID for N = 512, d(X) = 0.5 



D. Multi terminal A2A compression 



used in the single terminal case, we get the following: 



,. ™N y #1^6 W ] :I*(i)>ed(X)} 
lim sup — — = hm sup 



N- 



N 



N- 



N 

lim sup P(J* > ed(Jf)) 
P(7* >ed(X)) = d(X). 



Similarly, we can show that lim sup^^^ = d(Y\X). 

It remains to prove that {$fj,$ y N } is e-REP. Let Sx = 
{i € [TV] : J n (i) > ec2(X)} and S Y = {i £ [N] : 
Jn{i) > ec ^(^l^0} denote the selected rows to construct 
{$^,$^} and let Zf = H N X? and Wf = H N Yj" 
be the full measurements for the x and the y terminal. Let 



Bf = S x H [1 : % - 1] and B\ 



S Y n [1 



11 be the set 



of all indices in S x and S Y less than i. We have 



In this section, we will give a brief overview of the 
techniques used to prove the achievability part. The proof of 
the converse part is given in Appendix [B| 

Acievability proof for the mixture case: The proof tech- 
nique is very similar to the single terminal case. We will 
define the suitable erasure process and we will use it to 
construct the desired e-REP measurement matrices for the 
multi terminal case. Let {(X i ,Y i )}fL 1 , i E [N], be a two- 
terminal memoryless source, where N is a power of two. Let 
= H N Xf and Wf = H N Y^ . For i e [N], let us define 
I n (i) = diZ^Zr 1 ) and J n (i) = d(Wi\Wt\Z N ). Using 
Theorem [4] we can show that /„ and J„ are erasure processes 
with initial values d(X) and polarizing to {0, 1}. 

The next step is to construct the two terminal measurement 
ensemble. Let n G N and N = 2". We will construct $^ 
by selecting those rows of the Hadamard matrix, Hn, with 
I n (i) > ed(X). Similarly, Q V N is constructed by selecting 
those rows of H x with J n (i) > ed(Y\X). It remains to 
prove that the family {& N , $7^} labeled with N, a power of 
2, and of dimension m x N x N and m v N x N is e-REP with 
measurement rate (d(X), d(Y \X)). By this construction, we 
can achieve one of the corner points of the dominant face of 
the rate region. If we switch the role of X and Y we will 
get the other corner point (d(X\Y), d(Y)). One way to obtain 
any point on the dominant face is to use time sharing for 
the two family. However, it is also possible to use an explicit 
construction proposed in p2| , which directly gives any point 
on the dominant face of the measurement rate region without 
any need to time sharing. We will just prove the achievability 
for the corner point (d(X), d(Y\X)). 

Proof of Theorem [TTJ We first show that the family 
{Q N ,$ y N } has measurement rate (d(X), d(Y\X)). Notice 
that the processes I® , 1% converge almost surely thus, thay 
converge in probability. Specifically, considering the uniform 
probability assumption and using a similar technique as we 



, Y, N I Z Sx ,W SY )=d{Z?, Wr I Z Sx , W Sy ) 
<d{Z?\Z Sx ) + d{W?\Z?,W SY ) 

< ^2 d{Zi\Z B x,Z Sx ) 
ies° x 

+ d(Wi\W Br ,Ws Y ,Z?) 

< Ned(X) + Ned(Y\X) 

= eNd(X,Y) = ed(X 1 l ,Y 1 N ), 

which shows the e-REP property for the two terminal mea- 
surement family § x }- H 

Achievability proof for the discrete case: In the fully 
discrete case, the construction is very similar to the mixture 
case with the only difference that instead of using the RID, 
we will use the entropy. Similar to the single terminal case, 
we can prove the following. 

Lemma 2. (Im-Fn) an d (Jm^n) are positive martingale 
converging to almost surely. 

We again construct the family {$ N , $7^} by selecting those 
rows of H N with /„ > eH(X) and J n > eH(Y\X). 

Proof of Theorem [10} Similar to the single terminal 
case, it is easy to show that {§ N , ^ v x } has measurement rate 
(0,0). 

It remains to prove that ® y N } is e-REP . Let Sx = 

{1 € [N] : I n (i) > eH(X)} and S Y - {1 € [N] : 
Jn{i) > e H(Y\X)} denote the selected rows to construct 
{^ x N ,^ y N } and let = H N Xf and Wf = H N Yf be 
the full measurements for the X and the Y terminal. Let 
Bf = S c x n [1 : i - 1] and Bj = S Y n [1 : i - 1] be 
the set of all indices in S L X and Sy less than i. We have the 



following: 

H(X? , | Z Sx ,W Sy ) = H{Z? , W? | Z Sx , W Sy ) 

< H{Z»\Z Sx ) + H{W?\Z?,W Sv ) 

< H(Zi\Z B x,Z Sx ) 



+ J2 H(Wi\W Br ,Ws Y ,Z?) 

< HiZ^Zl^+Y, H(W t \Wr\Z?) 

< NeH(X) + NeH(Y\X) 

= eNH(X, Y) = eH{X?,Yf), 

which shows the e-REP property for the two terminal mea- 
surement family {$^,, <&^}. ■ 
The last step is to prove Theorem [8] namely, to show for a 
family of mixture distributions IT, there is a fixed measurement 
family which is e-REP for all of the distributions 

in II with a measurement rate in the Rhyi information region 
of the family. 

The proof is simple considering 



Proof of Theorem 12 



the fact that the construction of the family "T'jv} in the 



proof of Theorem 11 depends only on the erasure pattern 



which is independent of the shape of the distribution and 
only depends on its RID. This implies that for any (p x , p y ) in 
the Renyi information region of II, the designed measurement 
family {§ X N , is e-REP (IT). ■ 

V. Numerical simulations 

Up to now, we defined the notion of e-REP for an ensemble 
of measurement matrices. This definition is what we call 
an "informational" characterization, in the sense that taking 
measurements by the ensemble potentially keeps more than 
1 — e ratio of the information of the source. Now, we can ask 
the natural question that weather this has some "operational" 
implication, in the sense that after having the linear measure- 
ments, is it possible to recover the source up to an acceptable 
distortion? In particular, is there a computationally feasible 
algorithm to do that? 

To explain the operational view more, let us give an example 
from polar codes for binary source compression which has lots 
of similarities with what we have done. As shown in [20], for 
a binary memoryless source with P(0) = p, for a large block 
length n, there is a matrix G n , of dimension approximately 
equal to n/i2(p) x n such that the linear measurement of 
the source by this matrix over F2 faithfully captures all of 
the randomness of the source. This in its own only solves 
the encoding part of problem without directly addressing 
the decoding part, namely, it does not imply the existence 
of a decoder to recover the source from the measurements 
up to negligible distortion (error probability). Therefore, the 
operational picture is not complete yet. Fortunately, in the 
case of polar codes, the successive cancellation decoder (or 
other decoders proposed) fills up the gap and shows that the 



informational characterization implies the operational charac- 
terization. 

For simulations, we use a unit variance sparse distribution 
Px{x) = (1 — S)So(x) + Sp c (x), where S (x) is the unit 
delta measure at point zero, p c is the distribution of the 
continuous part and 5 G {0.0, 0.1, . . . , 0.9, 1.0} is the RID of 
the signal. We use the MSE (mean square error) as distortion 
measure. The simulations are done with the Hadamard matrix 
of order N = 512. To build the measurement matrix A, 
we select all of the rows of with highest conditional 
RID, 



as stated in IV-C 



until we get acceptable recovery 
distortion. Figure [3] shows the phase transition (PT) diagram 
for the £1 -minimization algorithm. The simulations are done 
with 3 different distributions for p c : Gaussian, Laplacian and 
Uniform. The acceptable recovery distortion is set to 0.01. 
The recovery is successful for the measurement rates above 
the plotted curves. The results show the insensitivity of the 
PT region to the distribution of the continuous components. 




Fig. 3: PT diagram for l\ -minimization 




Fig. 4: PT diagram for AMP and ^-minimization 

We also used the AMP algorithm to recover the signal, 
where for simplicity, we only did the simulations for the 



Gaussian case for p c . The AMP iteration is as follows: 

zt = y~ Ax t + -z t -i(ri' t _ 1 (A*z t -i + sct-i)), 
7 

&t+i = Vt(A*z t + x t ), 

where y = Ax is the linear measurements taken by A, 
7 is the measurement rate, (a") = Y^i=i a i/ n > Vt{ u ) = 
(i7t,i(ui),...,T?t,jv(ujv)) and where Vt,i(ui) = E{X\ui = 
X + r t N}, with N ~ 7V(0, 1) independent of the signal X 
and r t given by the state evolution equation for AMP, is the 
soft-thresholding function designed for the known distribution 
of X. For initialization, we use xq = and zq — 0. Figure 
[4] compares the PT diagram for AMP and l\ -minimization. 
Although AMP, with the thresholding function rj t designed 
for the known distribution of the signal, performs better than 
^-minimization, there is still a gap with the optimal line. 

Acknowledgment 

S. Haghighatshoar acknowledges Mr. Adel Javanmard for 
his helpful comments about the AMP algorithm. E. Abbe 
would like to thank Sergio Verdu for stimulating discussions 
on the Renyi information dimension. 

References 

[I] E. Candes, J. Romberg, and T. Tao, "Robust uncertainty principles: Exact 
signal reconstruction from highly incomplete frequency information," 
IEEE Transactions on Information Theory, vol. 52, no. 2, pp. 489-509, 
Feb. 2006. 

[2] D.L. Donoho, "Compressed sensing," IEEE Transactions on Information 
Theory, vol. 52, no. 4, pp. 1289-1306, Apr. 2006. 

[3] S. Kudekar and H.D. Pfister, "The effect of spatial coupling on compres- 
sive sensing," In Proc. 48th Annual Allerton Conference, 2010, pp. 347- 
353. 

[4] F. Zhang and H.D. Pfister, "Verification decoding of high-rate LDPC 
codes with applications in compressed sensing," IEEE Transactions on 
Inform. Theory, vol. 58, pp. 5042-5058, Aug. 2012. 

[5] A.G. Dimakis, R. Smarandache, P. Vontobel, "LDPC Codes for Com- 
pressed Sensing," IEEE Transactions on Information Theory, To appear. 

[6] M. Akcakaya and V. Tarokh, "Shannon-theoretic limits on noisy com- 
pressive sampling", IEEE Transactions on Information Theory, vol. 56, 
no. 1, pp. 492-504, Jan. 2010. 

[7] G. Reeves and M. Gastpar, "Sampling bounds for sparse support recovery 
in the presence of noise," in Proceedings of the 2008 IEEE International 
Symposium on Information Theory, Toronto, Canada, Jul. 2008. 

[8] S. Sarvotham, D. Baron, and R.G. Baraniuk, "Measurements and Bits: 
Compressed Sensing meets Information Theory," Proceedings of the 
44th Allerton Conference on Communication, Control, and Computing, 
Monticello, IL, Sep. 2006. 

[9] M. Wainwright, "Information-theoretic bounds on sparsity recovery in the 
high-dimensional and noisy setting," in Proc. IEEE Int. Symp. Information 
Theory, Nice, France, Jun. 2007. 

[10] W. Wang, M.J. Wainwright, K. Ramchandran, "Information-theoretic 
limits on sparse signal recovery: Dense versus sparse measurement 
matrices," IEEE Transactions on Information Theory, Vol. 56, No. 6, 
pp. 2967-2979, Jun. 2010. 

[II] D. Guo, D. Baron, and S. Shamai (Shitz), "A single-letter character- 
ization of optimal noisy compressed sensing," in Proceedings of the 
Forty-seventh Annual Allerton Conference on Communication, Control, 
and Computing, Monticello, IL, Oct. 2009. 

[12] Y. Wu and S. Verdu, "Renyi Information Dimension: Fundamental 
Limits of Almost Lossless Analog Compression," IEEE Transactions on 
Information Theory, vol. 56, no. 8, pp. 3721-3747, Aug. 2010. 

[13] Y. Wu and S. Verdu, "Optimal Phase Transitions in Compressed Sens- 
ing," IEEE Transactions on Information Theory, vol. 58, no. 10, pp. 6241- 
6263, Oct. 2012. 



[14] D.L. Donoho, A. Javanmard, and A. Montanari, "Information- 
theoretically optimal compressed sensing via spatial coupling and ap- 
proximate message passing," submitted to IEEE Transactions Information 
Theory, Dec. 2011. 

[15] S. Haghighatshoar, E. Abbe, E. Telatar, "Adaptive sensing using deter- 
ministic partial Hadamard matrices," IEEE International Symposium on 
Information Theory, pp. 1842-1846, Jul. 2012. 

[16] S. Haghighatshoar, E. Abbe, E. Telatar, "new entropy power in- 
equality for integer-valued random variables," Jan. 2013, Available: 
http://arxiv.org/abs/1301 .4185l 

[17] F. Krzakala, M. Mezard, F. Sausset, Y. Sun, and L. Zdeborova, "Sta- 
tistical physicsbased reconstruction in compressed sensing," preprint, 
Nov. 2011. Available: http://arxiv.org/abs/1109.4424 

[18] E. Abbe, "Universal source polarization and sparse recovery," IEEE 
Information Theory Workshop (ITW), Dublin, Aug. 2010. 

[19] E. Ankan, "Channel polarization: A method for constructing capacity- 
achieving codes for symmetric binary-input memoryless channels," IEEE 
Transactions Inform. Theory, vol. IT-55, pp. 3051-3073, Jul. 2009. 

[20] E. Ankan, "Source polarization," in Proc. IEEE Int. Symp. Inform. 
Theory, Austin, 2010. 

[21] E. Ankan and E. Telatar, "On the rate of channel polarization," IEEE 
International Symposium on Information Theory, pp. 1493-1495, Jul. 
2009. 

[22] E. Ankan, "Polar coding for the Slepian-Wolf problem based on 
monotone chain rules," IEEE Transactions Inform. Theory, pp. 566 -570, 
Jul. 2012. 

[23] D. Baron, M.F. Duarte, S. Sarvotham, M.B. Wakin, and R.G. Baraniuk, 
"An Information-Theoretic Approach to Distributed Compressed Sens- 
ing," Proceedings of the 43rd Allerton Conference on Communication, 
Control, and Computing, Monticello, IL, Sept. 2005. 

[24] D. Slepian and J.K. Wolf, "Noiseless coding of correlated information 
sources, IEEE Transactions Inform. Theory, vol. 19, pp. 471-^80, Jul. 
1973. 

[25] E. J. Candes, T. Tao, "Decoding by linear programming", IEEE Trans- 
action on Information Theory, vol.51, pp. 4203^1215, Dec. 2005. 

[26] E. J. Candes, T. Tao, "Near-optimal signal recovery from random pro- 
jections: universal encoding strategies", IEEE Transaction on Information 
Theory, vol. 52, pp. 5406-5425, Dec. 2006. 

[27] A. Renyi, "On the dimension and entropy of probability distributions," 
Acta Mathematica Hungarica, vol. 10, no. 1-2, Mar. 1959. 

Appendix A 
Proof of the converse part for the single 
terminal 

In this section, we will prove Theorem [5] which constitutes 
the converse part and puts a lower bound on the minimum 
number of linear measurements in order to to keep e-REP 
property. We will prove the following lemmas which will be 
used repeatedly for other parts. 

Lemma 3. Assume that $ is a full- rank matrix of dimension 
m x n, for m < n, and det(<J><J> T ) = 1. Then, there exists 
S C [n], \S\ = m such that | det(<J> s )| > > 2~? . 

Proof: As m < n from Cauchy-Binnet formula we have 

1 = dct($$ T ) = det($ s $£) 

SC[n],\S\=m 

= ]T dct($ s ) 2 . 

Sc[n],|S|=m 

As all of the terms are positive, there must be a S C [n] of size 
m such that dct(<P s ) 2 > pry which implies that | det(<J> s )| > 



Lemma 4. Let X be a continuous random variable with finite 
differential entropy and let D = [X] q . Suppose O is a random 
element for which the differential entropy and the RID of X 
given O are well-defined. Then, we have 



h(qX\D,O)<0, K m W>.0. 

g^oo log 2 (g) 



Proof: We have 



h{qX\D, O) = h(q(X - D)\D, O) < h(q(X - £>)). 



We know that < X — D < =, which implies that 
q(X — D) has a bounded support at most [0,1]. As the 
uniform distribution maximizes the differential entropy for a 
fixed support, we have h(qX\D) < h(U[0, 1]) = 0. We also 
have 



h(qX\D,Q) = h(X\D,Q)+log 2 (q) 

iog 2 (g) iog 2 («) 

= h(X\Q)-I(D;X\0)+log 2 (q) 
log 2 (t?) 

= h(X\Q)-H(D\0)+log 2 (q) 

log 2 (g) 
= l H(D\Q) | h(X\Q) 

log 2 (g) log 2 (g) ' 



Given O, X has a well-defined differential entropy, which 
implies that almost surely for all O, X conditioned on O is a 
continuous random variable. Therefore, 



lim 

q— > oo 



H(D\Q) 
log 2 0) 



d{X\0) 



Taking the limit as q tends to the infinity we get the result. ■ 

Putting O equal to null in the Lemma |4] we get the 
following corollary. 

Corollary 2. Let X be a continuous random variable with 
a well-defined differential entropy and let D = [X] q . Then 
h(qX\D) < and lim^ = 0. 

Lemma 5. Lef X™ foe a sequence ofi.i.d. continuous random 
variables and let _D™ = [Xfjg- Assume that $ « a full-rank 
matrix of dimension mxn where m < n and $>$> T = I m . Sup- 
pose O is a random element such that the differential entropy 
of<f>X™ given O is well-defined. Then h(q<Z>X?\D™ ,0) = 0. 

Proof: By Lemma [3] there is a S C [n] of size m such 



that | det($g)| > 2 S . Hence, we have 

h(q9X?\r%,0) = h(q9 s Xs + *S'XsAl>i>0) 

> h(q<5> s X s + q<S>scX s ,\D?,0) 

> h(q9 s Xs + q*S'XsAXs;Di,0) 
= h{q<f> s X s \X S c,D s ,0) 

= h( q X s \D s ,X S o,0) +log 2 (|det($ s )|) 

= J2HgXi\x Bi ,x S c,D s ,o)-~ 



ies 



J2\h(qXi\X Bi ,X S c,D s )\-- 
£>(9*<|Oi)| 



n 
2 







where S i= Sn[i - 1] and O, = {0,X Bi ,X S c, D S \ {i} }. 
The final result follows by applying Lemma [4] ■ 

Proof of Theorem [5} Without any loss of generality, 
we can assume that {$at} is a full-rank family, otherwise, we 
can drop some of the rows of $jv and obtain an equivalent 
family with lower measurement rate. Also, we can assume 
that the rows of $jy are orthonormal. Otherwise, by Gram- 
Schmidt procedure, we can obtain an equivalent family with 
orthonormal rows. In other words, there is a lower triangular 
and invertible mjy xiijjv matrix Ljy such that = ^tv^jv 
has orthonormal rows. As Ln is invertible, it results that 



H(D? |<Mf ) = JT(Df \L N $ N X? ) - H(D? \$ N X»). 

Thus the equivalent family {$Af} is also e-REP and has 
orthonormal rows, namely, $at$^ = I m , where we again 
dropped the dependence of m on N. We also represent each 
Xi, i £ [N] as Xi = ®iUi + QiVi. 

From e-REP assumption, for any rj > there is a Qi € N 
such that for q > Qi 



I(D{ 



N. 



QnX?) > H(D 1 )(1 



n) 



N\og 2 {q) 



log 2 (<?) 



(10) 



where = [X^] q . As we are going to take the limit as q 
tends to infinity, we can drop the negligible terms. In other 
words, we have 

I(D?;<S> N X?) . 7(Z>f,6f;Mf) 



Nlog 2 (q) 



N\og 2 (q) 
N\0g 2 (q) 



(11) 



where we used the fact that 7(9f ; .) < NH(@i) = 0. For 
a specific realization 0^, let Ce = {i € [AT] : 0j = 1} and 
= [A]\Ce as introduced before. Then, we obtain 



; |0f ) = I(D Ce , D Ce ; <£ Cs U Ce 



= I(Dc e ^c e Uc e +^CeVc e \Dc ) 
= I(D Ce -<5> Ce U Ce - 
= I(D Ce ;<l>c e Uc e - 
= I(D Ce ;^c e U Ce ), (12) 



<S>c e Vc e ,V Ce \D Ce ) 
<bc e V 5e \Dc e ,V 5e ) 



where D 



Cg 



[U Ce } g and D c 



C„ 



c e \q 



denote the 



component-wise quantization of Uc e and . We also used 
the fact that 

ff(£>cJ<^(Vfe 9 )<iVlf(yi) = 0. 

Let D u = [Ui] q and D v — [Vi] q . We consider two cases: 
First, if \Cq\ < to, using (12), we have 

I{D Cg -^c e Uce) < mH(D u ) 4 B x (0f ). (13) 

Second, if \C$\ > to, generally, §c a is neither full-rank nor 
orthonormal. However, we can drop the redundant rows and by 
using the Gram-Scmidt procedure, we can create an equivalent 
orthonormal matrix $ of dimension ml x \Cg\ with m! < m < 
\Cg\. Therefore, for this case we obtain 

I(D Ce ;$c e Uc e ) = I(D Ct ;*Uc t ) 
= h{*U Ct ) - h{*U c ,\D 0t ) 
= h($U c$ ) - h(q$U Ct \Dc t ) + m'log 2 (q) 

1 Y log 2 (27reo*) + ml log 2 (g) 

= m'\og 2 (q)^B 2 (6^), (14) 

where a\ is the variance of U\. We also used Lemma [5] 
h(q$Uc e \Dc e ) = 0, and the fact that the Gaussian distribu- 
tion maximizes the differential entropy for a given covariance 



matrix. Combining ( 13 1 and ( 14 1, and ml < to we obtain 

I(D?;$ N XF\6?) 1 mmax{log 2 (q),H(D u )}, 
which implies that 

I(D? ; $ivXf|6f ) * ronua{log 2 (g), H (D u )} 
Moreover, from ( fTO] ), (JTTJ and ( [15) , we get 

TV 1 'log 2 (g) ; - log 2 (g) 
Taking the limit as g tends to infinity, we obtain 

m 



(15) 



> 5(1 



which implies that 



limsup — > 5(1 — e - 

N^oo A 

As -q > is arbitrary, we get the result. 



rj). 



Appendix B 

Proof of the converse part for the multi terminal 

This section is devoted to the proof of Theorem [9] This the- 
orem puts constraints on the number of linear measurements 
we should take from different terminals in order to keep e-REP 
property. 

Proof of Theorem [9} From e-REP property, we have 



I([X?) q ,[Y?] q ^ N X?,& N Y x N ) 



>(l-e)ff([Xf] g ,[yA 



(16) 



Similar to the Y notation that we used in ([8]) for the represen- 
tation for Xi and Yf, we have 



= /([Xf] 

rJVl 



N*l ' 1 1 I 

"'if) 



As takes finitely many values, we can obtain the result 
for a specific realization 7^ and then take expectation over 
all possible realizations. For a specif realization 7^, if some 
of the components of $>^Xi and ^ y N Y^ are discrete or 
they are linearly dependent we can drop them. With some 
abuse of notation, let (Q^X^ , ^ V N Y^) denote the remaining 

rfj x N and 



components which have will have dimension 



x N, where rfj < m x N and r y N < m y N depend on the 



' N 

specific realization 7^. 

/ 7 ([Xf] g! [y^] g ;$^' 



N 



■ N 



-h^mxf^Yfiixn^h) 

^-h^X^^YflixFUY"],) 

= -h^q^X^q^YfUXn,, [^] 
+ (r x N +r v N )log 2 (q) 
< (m x N + m y N )\og 2 (q), 



(17) 



(18) 



where in (17i, we used the fact that h^(^%X x , ^Yf) is 
upper bounded by the differential entropy of a Gaussian ran- 
dom vector with appropriate covariance matrix which vanishes 
in the limit as q tends to infinity. Also, in dTHJ, we used Lemma 
I to show that h^q^X^q^Y^lixf^iYf]^ = 0. 
Therefore, taking the expectation over we obtain that 

iUX^lYfU^Xf^Y") ± (m* N +m y N )\og 2 (q). 

We also have #([Xf ]„ [Yf] q ) = Nd(X, Y) \og 2 (q). There- 



fore, using ( 16 1 and taking the limit as q tends to infinity, we 
obtain 

y 

p x + Py = ^r + ^r>(l- e)d(X, Y). 



N N 

To prove the other two inequalities, notice that 



J 7 ([Xf] 



I.dY^^X^^Y^ 



If([X{ 



N- 



■ N 



N 



KH^Y^ + I^X^^X^iY^) 



wf 



N 



(19) 



For the last term, I 7 ([Xf ] q ; <$> v N Yf \<S> X N X? , [Yf] q ), we can 
again assume that we have dropped all of discrete and linearly 
dependent terms from ^ y N Y^ so that it has a well-defined 
differential entropy. Thus, we obtain 



[Yf] q ) 

^([X^q^Yfl^X^lYf],) 



= h^qQlYf]® 



h 7 (q^ v N Y{ 



N 



rN 



(20) 
(21) 



Notice that, for the first term in pO} , 

It is easy to see that the random vector (Y^ — [lj ]) E 
[0, i] r N has a bounded support independent of q thus 
ft 7 (g$^ r Y 1 Ar |<i>^ r X^, [Y^g) has an upper bound independent 
of q. Therefore, 

Using a similar argument for ( |2l) , we have 

^(g^y^l*^, [Xf],) x 0. 

Assume i is a lower triangular invertible matrix, obtained 
through the Gram-Scmidt procedure, such that L<fr v N is an 
orthonormal matrix. Then applying Lemma [5] we obtain that 

^(^yfl^Xf , [5^],) b -log 2 (|det(L)|) b 0, 
^(^Yfl^Xf , [¥,%, [X? } q ) b -log 2 (|det(Z)|) b 0. 

This imphes that f,([yf],;$»y/'l^nPi Y l ! ) = °- 
Thus, from (1 9) , we obtain 

Again if $^rXf has discrete components or if some of the 
components are linearly dependent or can be predicted from 
[Yf^g we can drop them. With some abuse of notation let 
^%X^ denote the resulting random vector of dimension rfj < 
m x N . We have 

= hy(V° N Xi \\Yi N ] 9 ) - M***iW] 9 . [X»] q ) 

±-h^iSf f X»\\Y* f ] q) [X* r ] q ) 

= -h^q^X? [Xf ],) + log 2 (g) 

= r T N log 2 (g) < m x N log 2 (g) , (22) 

where we used the fact that /i 7 (<J>^ X ^ | [Y^],) = and from 
Lemma g h^q^Xf \[Yf] q , [X? y_= 0. Therefore, taking 



the expectation over T{ and using (16i and (22i, we obtain 



m 



N 



log 2 (g) b (1 - e)ff([Xf ]„ [5^],|rf ) - H(\Y»] 9 \T«) 



(l-^HdX^lY^-HdYr 



which implies that > d(X|Y) - ed(X,Y). Therefore, 
taking the limit as N tends to infinity, we get 

p x = limsup^ > d(X\Y) - ed(X,Y). 

AT->oo J» 

The last inequality in the theorem follows by symmetry. ■ 



